cellula

cellula is a simple R-based pipeline for single cell RNA-seq processing with a number of methods for integration and identity assignment.

cellula follows the practices outlined in the OSCA book, with some additional options for integration/batch effect correction methods.

As a one-stop solution, this package tends to make choices for the users, with the caveat that these choices follow either defaults or sensible implementations. However, this means that a certain degree of freedom is removed from the end user. This assumes that users who desire total control on the process (or granular specification of parameters) do not need cellula and would be more comfortable setting up their own analysis pipelines.

cellula exists to automate and share routine analyses the way I usually do them, and offer “quick and dirty” access for exploratory data analysis.

Where does the name “papplain” come from? it’s an inside joke with some friends and ex colleagues.

Why did you change it to “cellula”? one day I’d like to share this tool and I need a name that is not too dumb.

Table of Contents

Install

For the time being, clone the repo and install:

devtools::install("path/to/cloned/git")

cellula is dependency-heavy, which is not something I’m proud of, but makes sense considering this is a wrapper to a series of different analytical approaches.

The package will require a number of BioConductor and GitHub dependencies. You can install them as follows:

BiocManager::install(c("scran", "scuttle", "bluster", "scater", "batchelor", "DropletUtils", 
                      "AUCell", "harmony", "GSVA", "gdagstn/oveRlay", "UCell", 
                      "slingshot", "TSCAN"))

Usage

The pipelines in cellula assume that you are working with the output of CellRanger (or something similar) and you imported it into a SingleCellExperiment object (hereafter SCE) using DropletUtils::read10Xcounts(). This is relevant for gene identifiers, since the rowData slot of the SCE will have a “Symbol” and a “ID” column.

For demo purposes we can use a publicly available dataset, Segerstolpe et al. 2016 ref, which we retrieve using the scRNAseq package:

# BiocManager::install("scRNAseq") 
sce <- scRNAseq::SegerstolpePancreasData()
colnames(rowData(sce)) = c("Symbol", "ID")

Assuming that you have formed a SCE object containing the “individual” column that identifies different batches, you can run an integration pipeline as follows:

sce <- cellula(sce, name = "myproject", batch = "individual", 
                integration_method = "Harmony",
                verbose = TRUE, save_plots = TRUE)

The name argument defines the name of the folder that will be created to store files and plots. We set verbose = TRUE to print the progress of the pipeline. Setting save_plots = TRUE will create a few QC plots in the name/plots folder: total UMI, total genes detected, UMI x genes; optionally % MT, % Ribo and %MALAT1, total UMI x doublet class. Plots are separated according to whether the cells were discarded or not in the filtering step.

The cellula() function is a wrapper around a few modules or sub-pipelines that have different degrees of customization.

The scheme is:

cellula
    ├── Quality Control [QC]
    |    ├── run emptyDrops (optional) [QC/EMPTY]
    |    ├── score mito/ribo/malat1 subsets (optional)
    |    ├── filter out (optional)
    |    └── doublet finding (optional) [QC/DBL]
    ├── Normalization and dimensionality reduction [NOR]
    |    ├── pre-clustering
    |    ├── computing pooled factors
    |    ├── log-normalization (simple or multi-batch)
    |    ├── HVG finding (simple or multi-batch)
    |    ├── PCA
    |    └── UMAP
    ├── Integration [INT] (optional) - choose one method
    |    ├── fastMNN [INT/MNN]
    |    |    ├── integration
    |    |    └── UMAP
    |    ├── Seurat [INT/SEURAT]
    |    |    ├── conversion to Seurat
    |    |    ├── normalization and HVG finding
    |    |    ├── find integration anchors
    |    |    ├── integrate data
    |    |    ├── scale data
    |    |    ├── PCA
    |    |    └── UMAP
    |    ├── LIGER [INT/LIGER]
    |    |    ├── conversion to LIGER
    |    |    ├── normalization
    |    |    ├── HVG finding
    |    |    ├── scale data
    |    |    ├── NMF
    |    |    ├── quantile normalization
    |    |    └── UMAP
    |    ├── Harmony [INT/HARMONY]
    |    |    ├── Harmony matrix (on PCA)
    |    |    └── UMAP
    |    └── Regression [INT/regression]
    |         ├── regression on logcounts
    |         ├── PCA
    |         └── UMAP
    └── Cell type annotation [ANNO] (optional) - choose one method
              ├── Seurat AddModuleScore
              ├── ssGSEA
              ├── UCell
              └── AUCell
             

Most of the choices can be made around the integration method. cellula has implemented 5 methods: fastMNN and regressBatches from the batchelor package, Harmony, the CCA-based Seurat method, and non-negative matrix factorization (NMF) from LIGER through the rliger package.

LIGER and Seurat integration methods require an intermediate step where package-specific objects are created and some pre-processing steps are repeated again according to the best practices published by the authors of those packages.

Each step of the pipeline can be called independently on the object:

sce <- doQC(sce, name = "segerstolpe", batch = "individual", save_plots = TRUE)
sce <- doNormAndReduce(sce, name = "segerstolpe", batch = "individual")
sce <- integrateSCE(sce, batch = "individual", method = "Seurat")

Plotting

There are a few simple plotting functions in cellula:

  • plot_UMAP() to plot a UMAP (or any other 2D dimensional reduction)
  • plot_Coldata() to plot data from the colData slot as a boxplot, scatterplot or confusion matrix
  • plot_dots() to plot a dot plot of gene expression

plot_UMAP()

You can choose point color using the color_by argument, and facetting is supported via the group_by argument. Additionally you can choose a shape_by for symbols, and label_by to place labels on the plot. Note that shape, group, and labels need to be categorical (i.e. factor) variables, whereas color can be numeric. The color palette is automatically generated, but it can be set by the user through the color_palette argument.

plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "individual", group_by = "disease")

plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "sum")

plot_Coldata()

Takes as input x and y as column names from colData(sce), with an optional color_by and group_by argument for facetting.

This function returns different plots depending on the class of the 2 colData columns selected: - if y is a numeric and x is categorical (character or factor), it returns a combined violin-boxplot with one plot per level of x.

plot_Coldata(sce, x = "individual", y = "sum") + scale_y_log10()

Additionally, if the color_by argument specifies another column, every x will be divided by levels of color_by. With the appropriate use of the x, color_by and group_by variables once an look at 3 different groupings of y at once.

plot_Coldata(sce, x = "individual", y = "sum", color_by = "disease", group_by = "cell type") + scale_y_log10()
  • if y and x are both categorical, it returns a heatmap of the confusion matrix where every value is the pairwise Jaccard index between sets for any given level pair (this is mostly useful to check for differences in clustering/annotations)

  • if y and x are both numeric, it returns a scatterplot with an optional 2D kernel density contour plot overlaid.

plot_Coldata(sce, x = "sum", y = "detected") + scale_x_log10()

plot_dots() - see below.

Parallelization

Since cellula is mostly based on R/Bioconductor packages, it offers the BiocParallel parallelization backend through its parallel_param argument. In some cases, e.g. the Seurat integration, another type of backend has to be set up separately outside of the function call.

BiocParallel parallelization is implemented where possible, i.e. all the steps of the pipeline where it is sensible to use them (PCA, clustering, integration…).

sce <- integrateSCE(sce, batch = "individual", method = "fastMNN", parallel_param = MulticoreParam(workers = 4))

Clustering

cellula also offers a wrapper around clustering functions. For now, only SNN-based Louvain and Leiden clustering are implemented. The makeGraphsAndClusters() function allows users to do parameter sweeps along either the number of neighbors or the resolution of the clustering.

In this example we sweep along the value of the resolution parameter for a Louvain clustering. For the SNN graph constructions, edges are weighted according to the jaccard index of their shared neighbors, mimicking the Seurat graph construction and clustering procedure:

sce <- makeGraphsAndClusters(sce, k = c(0.1, 0.25, 0.5, 0.75, 1),
                             space = reducedDim(sce, "PCA_Harmony")[,1:20],
                             sweep_on = "clustering", method = "louvain", 
                             weighting_scheme = "jaccard", prefix = "SNN_",
                             verbose = TRUE)

If another integration method has been run on the same object (e.g. Seurat integration), then the clustering can be performed on that integrated space by specifying the space argument (in this case, reducedDim(sce, "PCA_Seurat")):

sce <- makeGraphsAndClusters(sce, k = c(0.1, 0.25, 0.5, 0.75, 1), 
                             space = reducedDim(sce, "PCA_Seurat")
                             sweep_on = "clustering", method = "louvain", 
                             weighting_scheme = "jaccard", prefix = "Seurat_SNN_",
                             verbose = TRUE)

The default value for space is NULL and will use the "PCA" slot from the reducedDim() accessor.

Plotting clustering results

You can visualize clustering results on the UMAP using the plot_UMAP() function, adding labels if desired:

plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "SNN_0.5", label_by = "SNN_0.5")

If using clustree, the clustering tree can be visualized by using the same prefix defined in makeGraphsAndClusters():

library(clustree)
clustree(sce, prefix = "SNN_")

The default arguments to the clustering wrapper include the generation of modularity and approximate silhouette scores for every clustering round. These will be stored in the metadata of the SCE, named according to the prefix, the resolution, and the "modularity_" and "silhouette_" prefixes. Silhouette and modularity can be visualized by using the dedicated functions:

plotSilhouette(sce, "SNN_0.5")

plotModularity(sce, "SNN_0.5")

Gene dot plot

You can also use the plot_dots() function to plot the popular dot-plot for marker genes.

This function takes in a SingleCellExperiment object, together with a vector of genes (matched to the rownames of the object), and a grouping variable specified by the group_by argument. Additionally, dots can be ordered by hierarchical clustering on either genes, groups, or both (set cluster_genes and/or cluster_groups to TRUE, which is the default). Colors can also be customized via the color_palette argument. Finally, the user can choose whether they want genes to be columns (format = "wide", the default) or rows (format = "tall").

# Quick and dirty marker calculation
markers = presto::wilcoxauc(sce, group_by = "SNN_0.5")
markerlist = split(markers, markers$group)

for(i in seq_len(length(markerlist))) {
  markerlist[[i]]$deltapct = markerlist[[i]]$pct_in - markerlist[[i]]$pct_out
  markerlist[[i]] = markerlist[[i]][order(markerlist[[i]]$deltapct, decreasing = TRUE),]
}

top5 = lapply(markerlist, function(x) x$feature[1:5])
markergenes =  Reduce(union, top5)

plot_dots(sce, genes = top5, group_by = "SNN_0.5")

Metaclusters

Aditionally, metaclusters can also be identified. A metacluster is a cluster of clusters obtained by different clustering methods. Clusters across methods are linked acording to how many cells they share, and these links become edges of a graph. Then, Louvain clustering is run on the graph and the communities that are identified are metaclusters. These metaclusters show the relationship between clustering methods. Moreover, they can be used to understand cluster stability along different parameters and/or integration methods. A cell can belong to different clusters according to the clustering method (i.e. to the resolution or to the space that was used). If a cell belongs to clusters that are consistently included in a metacluster, then that cell belongs to a “stable” cluster. If instead the cell belongs to clusters that have different metacluster assignments, then it’s in an “unstable” position, meaning it may be clustered differently according to integration methods and/or resolutions.

The metaClusters() function takes a clusters argument, which is a vector of column names from the colData of the SCE where clustering results are stored. In this example it is easy to isolate by using grep() and searching for the prefix “SNN_”.

clusterlabels <- colnames(colData(sce))[grep("SNN_", colnames(colData(sce)))]
sce <- metaCluster(sce, clusters = clusterlabels)

Every cell will belong to a series of clusters, which in turn belong to a metacluster. For every cell, we count how many times they are assigned to a particular metacluster, and the maximum metacluster is assigned, together with a “metacluster score” (i.e. the frequency of assignment to the maximum metacluster) and whether this score is above or below a certain threshold (0.5 by default). These columns are saved in the colData slot of the SCE.

plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "metacluster_score", label_by = "SNN_0.5")

Assigning cell identities

cellula implements 4 methods for automated cell identity assignment, based on the Bioconductor AUCell package, the GSVA ssGSEA implementation, the Seurat AddModuleScore() function or the UCell method.

The function requires a genesets named list containing genes to be used for scoring every single cell. These can be obtained through other packages, e.g. msigdbr. For instance, if we wanted to take all the Muraro et al. signature genes, present in the C8 collection, we would do:

library(msigdbr)

type_genes = msigdbr("Homo sapiens", category = "C8")

genesets = lapply(split(type_genes, type_genes$gs_name), function(x) x$gene_symbol)

muraro_genes = genesets[grep("MURARO", names(genesets))]

Then, we would use the assignIdentities() function from cellula to calculate signature scores:

sce = assignIdentities(sce, 
                       genesets = muraro_genes, 
                       method = "AUC")

Other methods are “Seurat”, “UCell”, and “ssGSEA”.

This will create a column named “labels_AUC” (or anything else the user determines using the name argument) in the colData(sce). Assignments can be plotted:

plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "labels_AUC")

You can now see why it can be useful to plot a confusion matrix as a heatmap:

plot_Coldata(sce, x = "SNN_0.5", y = "labels_AUC")

You can also use single signatures as an input, which will result in adding the score to the colData slot of the SCE directly, rather than an assignment:

sce <- assignIdentities(sce, 
                        genesets = muraro_genes$BETA_CELL, 
                        method = "AUC", 
                        name = "Beta_Cell_signature")

plot_UMAP(sce, umap_slot = "UMAP_Harmony", color_by = "Beta_Cell_signature")

The "UCell" method works well when you have small signatures (e.g. even 2/3 genes). It allows you to specify positive and negative labels, which is useful when you are sure the identity of a cell types depends on the lack of expression of certain markers (see hematopoietic lineages). To do so, you can add “+” or “-” to each gene.

Downsampling

There are two ways to downsample data in cellula: downsampling reads and downsampling cells.

The first approach simulates reads randomly sampling counts from a distribution with a fixed total number, using a vector of probabilities equivalent to the per-gene proportion of reads within each cell. Briefly, let’s consider a cell C in which genes a, b, and c have been quantified with 50, 30, 20 counts each (totaling to 100 counts). This is equivalent to a bag of marbles in which the probability of randomly picking an a marble is 50/100 = 0.5, b marble is 0.3, and c marble is 0.2. If we want to downsample C to a total of 40 counts (yielding the downsampled C’), we randomly pick 40 counts from a (0.5, 0.3, 0.2) vector of probabilities. This is a sort of downsampling by simulation and is described in Scott Tyler’s work, reimplemented in cellula with a slightly faster optimization.

The downsampleCounts() uses a minimum count number that is user-defined (or the minimum total count number in the dataset as a default) and returns a SingleCellExperiment object with the same number of cells as the input, and a down-sampled count matrix where each cell has the same total number of counts.

The second approach randomly select cells from within groups such as clusters, batches, or a combination of the two. Cells are randomly selected so that they represent a user-defined fraction of the within-group total, with some lower bound to ensure that small groups are represented: if a rare cluster label only contains 9 cells and we want to downsample a dataset to 10%, we can cap the minimum to 5 cells so that we ensure the rare label is still adequately represented.

The downsampleCells() function returns a SingleCellExperiment object with fewer cells than the input, as defined by the proportion and min parameters.

Inferring trajectories

At the time of writing cellula only implements a wrapper around the slingshot method for pseudotemporal trajectory inference, and the testPseudotime() method from TSCAN for differential expression along a trajectory.

The findTrajectories() function takes a SingleCellExperiment object as input, and requires the user to specify the cluster label which will be used as an input to the MST creation in slingshot. Other important parameters are space - the reduced dimensional reduction in which trajectories will be estimated (default is “PCA”), start - the starting cluster for trajectory estimation, defaulting to “auto” for an entropy-based method, omega - whether or not to use a synthetic cluster to estimate disjointed trajectories, as detailed in ?slingshot::getLineages, and doDE - whether or not to perform differential expression on every trajectory.